Grants:IEG/Pan-Scandinavian Machine-assisted Content Translation

From Meta, a Wikimedia project coordination wiki
statusselected
Pan-Scandinavian Machine-assisted Content Translation
summaryThis project will create more machine translation language data for use in Content Translation, filling in the blanks in the pairings between the Swedish, Danish and Norwegian Bokmål+Nynorsk.
targetDanish, Swedish, Norwegian Bokmål and Norwegian Nynorsk
strategic priorityincrease reach
themetools
amount10,000 $
granteeUnhammer
advisorFrancis_Tyers
contact• unhammer+apertium@mm.st
this project needs...
volunteer
join
endorse
created on11:36, 18 September 2015 (UTC)

Project idea[edit]

What is the problem you're trying to solve?[edit]

Content Translation is a wonderful tool for translating articles between Wikipedias, and between some closely-related language pairs it's possible to get good machine translation suggestions from Apertium. This has proven very helpful in creating new articles, in particular the users of Spanish-Catalan have been very enthusiastic about it.[1] However, for most such language pairs there is no Apertium machine translation data.

In particular, although there is Norwegian[2]→Danish and the intranorwegian pair Nynorsk↔Bokmål, there is no Danish→Norwegian, Swedish→Norwegian nor Danish/Norwegian→Swedish ("Norwegian" here referring to both Nynorsk and Bokmål). This is too bad, since the resources are there, it just takes the time and dedication to create the Apertium translators.

The lack of machine-assisted Content Translation in the end means fewer localised articles. So for example, a Norwegian user searching the web for a certain term might not even see that there is a great Swedish article on that subject, since search engines tend to prefer localised hits (and if not, the English version would likely win the search ranking due to number of inbound links). Wikipedia readers thus become accustomed to searching in English, and not seeing the knowledge that exists in their neighbouring countries.

What is your solution?[edit]

Apertium is a free/open-source machine translation platform, with a large community of FOSS contributors – over 300 people have committed changes since 2005, with 72 active committers the last 12 months.[3] The Apertium project does not employ any staff, but some times has funding (typically through Google Summer of Code) to support limited contracts.

The project will involve using the existing Apertium MT resources as a basis for developing machine translation systems between the Norwegian/Danish/Swedish language pairs missing in Content Translation. We start with the ones requiring least time to completion, so first Danish→Norwegian (meaning both Nynorsk and Bokmål), then Swedish→Norwegian and, finally Danish→Swedish and Norwegian→Swedish.

Apertium machine translation packages are already integrated into Content Translation, so as soon as the MT data is created, they can be used to create new Wikipedia translations faster.

Screenshot of Content Translation Bokmål→Nynorsk

Project goals[edit]

More machine-assisted Content Translation means that we can work faster on translating articles; this project ultimately aims to create more Wikipedia articles in the affected languages, and ensure readers get to see more articles in their own language, and have access to more of the knowledge that exists in neighbouring Wikipedias (which might not even be translated yet to the English Wikipedia due to having more local relevance).

Project plan[edit]

Activities[edit]

The main developer will use existing Apertium resources as a basis for creating new language pairs, prioritising the pairs that are closest to being release-worthy.

A tentative work plan in order of priority:

  1. Create structural transfer rules for Danish→Norwegian, do frequency-based additions/modifications to the bilingual dictionary to ensure fluent translations are preferred. Danish already includes disambiguation rules, but these may need some fixes.
  2. Release Danish→Norwegian and ensure it's running in Content Translation.
  3. Expand Swedish Apertium dictionary with SALDO lexicon ( http://spraakbanken.gu.se/eng/resource/saldo ) and write disambiguation rules for Swedish.
  4. Create initial bilingual dictionary candidates by crossing dan-nor and swe-dan dictionaries, by generating translations through orthographic regularities, and by generating compounds. Manual checks of candidate lists, sorted by frequency in Wikipedia articles.
  5. Create structural transfer rules and closed-class bilingual dictionary entries for Swedish→Norwegian.
  6. Release Swedish→Norwegian and ensure it's running in Content Translation.
  7. Create structural transfer rules for Danish→Swedish, do frequency-based additions/modifications to the bilingual dictionary to ensure fluent translations are preferred.
  8. Release Danish→Swedish and ensure it's running in Content Translation.
  9. Create structural transfer rules for Norwegian→Swedish, do frequency-based additions/modifications to the bilingual dictionary to ensure fluent translations are preferred.
  10. Release Norwegian→Swedish and ensure it's running in Content Translation.
  11. Gather and clean bilingual corpora for all pairs, train lexical selection models to improve fluency/word-choice, re-releasing pairs as training completes.

Out-of-scope issues[edit]

There are many issues that are needed or could be useful, that for various reasons are out of the scope of this project; for example:

  • general improvements to the Content Translation tool (like logging corrections or allowing to continue published translations) are handled by the Wikimedia Language Engineering team,
  • creating the Debian builds of MT packages used in Content Translation are handled by Tino Didriksen of Apertium and Kartik Mistry of Language Engineering
  • deployment of packages and running the Content Translation server is handled by the Language Engineering team

However, deployment issues specific to these language pairs are of course in-scope of this project.

Budget[edit]

The estimated workload is about 3 full-time person-months for an experienced Apertium developer; or 6 person-months at 50 %. This workload estimation is based on the main developer's previous experience with similar projects. The total amount requested is 10,000 $.

Budget breakdown[edit]

Item Description Commitment Person-months Cost
Main developer Developing and releasing machine translation language pairs Part time (50 %) 6 12,500 $
Co-funding Confirmed co-funding by Apertium project N/A N/A -2,500 $
Total 10,000 $

The item costs are computed as follows: The main developer's gross salaries (including 35 % Norwegian income tax) are estimated upon pay given to similar projects by University of Tromsø using Norwegian standard salaries,[4] given the current exchange rate of 1 NOK = 0.1225 USD, and a quarter of a year's full-time work.

Community engagement[edit]

By releasing early, we'll try to get Content Translation users to test the machine-assisted translation for their language pair, and get their feedback on what the most annoying errors of the system are (Content Translation may also get a feature to automatically store corrections, but that's not in place yet).

We'll also open pages where users themselves can contribute missing word/phrase translations and discuss the MT systems, and we'll be trying to get interested Wikipedians to get involved in Apertium language data development themselves (although there is a slight learning curve).

Sustainability[edit]

All Apertium code and data is free and open source. Packages exist for the newest Debian versions for both code and data (and the Apertium project has hired a package maintainer to ensure they don't go stale). The Content Translation system already uses these Apertium Debian packages.[5] So the Content Translation system should be able to use the code and data itself without having to worry about dependency problems or bit-rot.

Once language pairs are up to a stable point in Apertium, it is easy to add new words, and if we manage to recruit other developers they also should be able to keep the data expanding.

Although Wikipedia vocabulary and frequency lists would be used for development during this project, the free and open source translation data will also be useful on its own for unrelated projects (for bilingual dictionaries, as input to other translation systems, use in translation apps, and so on).

Measures of success[edit]

Intrinsic measures[edit]

These are the main milestones of the project:

  • Release of dan→nno/nob translator
  • Release of swe→nno/nob translator
  • Release of dan→swe translator
  • Release of nno/nob→swe translator

Each of these milestones should be a useful product to the Wikimedia communities of the target language.

The Release policy of Apertium gives some minimal requirements for what we can call a release; for this project we include the requirement of having the pair running in Content Translation.

Other measurable goals include:

  • the size of bilingual dictionary (released bilingual dictionaries are anywhere between 4034 and 55852 non-name entries, mean and median around 19000)
    • this would be for Swedish-Norwegian; for the other pairs a (reversible) bilingual dictionary already exists
  • number of transfer rules (between 20 and 50 is common for closely related language pairs).

Extrinsic measures[edit]

In the long-term, a measure of success would be an increase in the "translations to Nynorsk/Bokmål/Swedish/Danish" graphs here:

Note that in particular, there are very few translations into Swedish (for which there is no MT support yet), while the MT-supported source languages are all among the top in their respective target languages in the "Translations into" table.[6]

Romance languages[edit]

We could try to compare the situation with the Romance languages, where there is more data. There has been a visible increase in number of articles created per day in Catalan since Content Translation was introduced for Spanish→Catalan in July 2014 compared to the previous two or three years;[7] the mean for the 14 months since July 2014 is 103 articles per day, vs 77 for the preceding 14 months, a ratio of 1:1.33. For the other Romance languages, we see slight decreases in number of new articles per day in the same period regardless of whether they had MT-assisted Content Translation, but note that MT-support for Portuguese and Spanish arrived 4 months later. The table below shows ratio differences for the periods before and after introducing MT-support. The differences between Catalan and Spanish/Portuguese could be partly due to the fact that Catalan Wikipedia is the smallest of the three (about half the number of articles of PT, or a third of ES), and thus has more to gain from translation from the other two. Catalan also has a very high amount of editors per speaker compared to Spanish and Portuguese.

Period CA (MT) ES (MT) PT (MT) FR (no MT) IT (no MT)
14 months before/after July 2014 1:1.33 N/A N/A 1:0.97 1:0.92
10 months before/after Nov 2014 N/A 1:0.98 1:1.00 1:1.01 1:0.89
Scandinavian Wikipedias size[edit]

Comparing sizes for the Scandinavian Wikipedias is not trivial; e.g. as of 2015-09-23, Swedish Wikipedia nearly 2 million articles, but over 1.3 million are created by bots;[8] while Norwegian Bokmål has under half a million articles because of tight authorisation of bots, but more editors per million speakers. Danish has about half that amount of articles, and half the number of editors per speaker. Thus neither language is exactly in the situation of Catalan (fewer articles than the source languages, but more editors per speaker), although it seems likely that the number of editors per speaker would be the best predictor of getting more Content Translations after we add MT.

Swedish[9] Bokmål[10] Nynorsk[11] Danish[12]
articles 1,993,354 418,062 121,929 208,983
editors per million speakers 66 75 8† 35
"active" editors 658 354 36 212
bot-created articles 1.4 M[8][13] 8.5k[13] 1.8k[13] 6,736[14]–12k[13]

† The stats summary pages giving editors per million speakers for Bokmål/Nynorsk are misleading, since both Bokmål and Nynorsk users come from the same pool of Norwegian speakers. Estimates for Nynorsk users lie at around 10-15 % of the population, so normalised numbers are quite similar for Bokmål and Nynorsk.

Get involved[edit]

Participants[edit]

  • Unhammer – Main developer. Self-employed MsC of Computational linguistics and Natural Language Processing. I am a volunteer in the Apertium project. I developed the Nynorsk-Bokmål Apertium translator[15] funded by Google Summer of Code 2009, where we used Wikipedia text for part-of-speech tagging, contributions through Wikipedia for expanding the lexicon, and Wikipedia translations for evaluation. I've later worked on Northern Saami→Bokmål through the University of Tromsø, as well as mentoring in my spare time for Google Summer of Code since 2010 (including the Norwegian→Danish translator). I've also worked on Apy, the server that's used in (among others) Content Translation to manage Apertium translators.
  • Francis Tyers – Advisor. Long-time Apertium developer and Wikipedia contributor, Doctor of Machine Translation, currently working at UiT Norgga árktalaš universitehta.

Community notification[edit]

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Endorsements[edit]

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • This project would expand on existing [open-source!] infrastructure to fill a useful gap. Disclosure: I develop Apertium resources for Turkic languages. —Firespeaker (talk) 20:13, 22 September 2015 (UTC)
  • This would be a useful project - having worked with Apertium, I'd say it'd help build several pretty valuable (and potentially rather high quality) open-source translation pairs. —Vinivars
  • This is a great idea. We have already so much Scandinavian resources and working language pairs at Apertium (which I am part of), that taking these 'last steps' to get MT working across Scandinavian languages makes sense. This has the potential to be really beneficial for Wikimedia's Content Translation -Jn0101 (talk)
  • This would indeed be a great grant proect for the Scandinavian-languages Wikipedias. I'm all for it. Jon Harald Søby (talk) 12:03, 23 September 2015 (UTC)
  • I just want to say that we have an ongoing open collaboration with Apertium, we know who they are, we meet in different contexts... At least from a strategic point of view, funding a project in this context would be sensible.--Qgil-WMF (talk) 12:04, 23 September 2015 (UTC)
  • I'm all for this! On the trends on machine translation from Nynorsk into Bokmål, note that there has been some issues blocking proper translation but those seems to be solved now. I am a long-time contributor in the Norwegian community and probably somewhat biased on the prospect of getting (better) language pairs for the Scandinavian languages. — Jeblad 12:20, 23 September 2015 (UTC)
  • I think this would be very useful for all the Scandinavian projects. I strongly support this. --Tarjeimo (talk) 13:04, 23 September 2015 (UTC)
  • Go for it! Kimsaka (talk) 13:05, 23 September 2015 (UTC)
  • I'm not familiar with Apertium, but the basic idea is very attractive and very helpful for translations between the Scandinavian languages. I'm in doubt about "nynorsk" - as a subject for the project as it largely differs from norsk (bokmål), svensk and dansk.--Ramloser 16:03, 23 September 2015 (UTC)
    • This is not correct, they are quite similar to Norwegian Nynorsk. I've translated the article w:no:Tonsåsen (Norwegian Bokmål) into w:nn:Tonsåsen (Norwegian Nynorsk)with ContentTranslation and found four (4) errors due to Apertium, and some more due to ContentTranslation. There are still bugs to weed out, some templates gets really messed up) but ContentTranslation with Apertium starts to be a really good tool. — Jeblad 15:31, 23 September 2015 (UTC)
    • Hi there, thanks for the feedback, Ramloser. If you're not familiar with the details of the written forms of Norwegian, I recommend w:en:Languages_of_Norway#Norwegian as a starting point. --Unhammer (talk) 07:12, 24 September 2015 (UTC)
  • I would certainly like to use ContentTranslation for translating articles about Swedish matters from Swedish to Norwegian. And the language resources benefits not just the Wikipedias, but other projects as well, which is great. Danmichaelo (talk) 15:05, 23 September 2015 (UTC)
  • I didn't read the whole thing so don't count this as an endorsement, but if someone can reach the stated goal that's Unhammer and Spectre. As a wikimedian I'm flattered they even considered contributing here. Nemo 17:40, 23 September 2015 (UTC)
  • I'm all for this! Chairman Wikimedia Norway Hogne (talk) 20:51, 23 September 2015 (UTC)
  • To me it seems like a good project to support. I know from own experience that almost any article will need proof-reading, translations not excepted. But that job will become much easier, so I do like the idea. User:Bjørn som tegner
  • I write on Swedish WP and have discussed this myself with other users in Sweden and Denmark recently, as much work could be saved and many more articles could be added, especially those about each country and its places/people/arts etc. I think one should organize a group of interested writers to assist in translating/adjusting within the Nordic countries as a start. Bemland (talk) 23:21, 23 September 2015 (UTC)
  • It seems like an eminently sensible idea to develop this tool for these language pairs. Vinguru (talk) 04:48, 24 September 2015 (UTC)
  • Any reuse of text between the Swedish, Danish and the two Norwegian language Wikipedias involves a time consuming manual editing word by word of small systematic differences in spelling and vocabulary. A good open source Apertium translation will become a huge time saver for everyone who would like more cooperation and exchange between the Scandinavian language communities. Unhammer has a track record of developing MT in the leading Open Source engine for translation of closely related languages. H@r@ld (talk) 08:08, 25 September 2015 (UTC)
  • Very useful idea! Petter Bøckman (talk) 15:11, 26 September 2015 (UTC)
  • Support! --- Løken (talk) 22:58, 26 September 2015 (UTC)
  • Looks like an obvious place to begin. Palnatoke (talk) 07:22, 27 September 2015 (UTC)
  • Talented people. The proposed work benefits a group of languages and can make them all stronger. These languages are active, so I believe these they will start taking advantage of the machine translation as soon as it is available. It requires very small effort from the Language Engineering team to integrate this work in Content Translation. The results will also be useful for translation activities using the Translate extension on most multilingual Wikimedia projects plus translatewiki.net. Nikerabbit (talk) 08:38, 27 September 2015 (UTC)
  • useful tool, it is always good to strengthen interaction between similar platforms. It means better quality in generale and therefore better quality for other languages, whatever is the language we end up translating from. --Alexmar983 (talk) 19:11, 27 September 2015 (UTC)
  • Unhammer is very active Apertium developer and I already worked with his packages in Debian and all work is high quality. Proposal looks very good and well planned. It will be beneficial to number of languages and Wikipedias which are using Machine Translation using Content Translation. KartikMistry (talk) 09:46, 30 September 2015 (UTC)

References[edit]

  1. https://blog.wikimedia.org/2015/04/06/content-translation-improved-my-edits/
  2. When using the term "Norwegian", both Nynorsk and Bokmål forms are meant.
  3. https://www.openhub.net/p/apertium
  4. https://www.regjeringen.no/no/dokumenter/lonnstabeller/id438643/#foerti - lønnstrinn 47
  5. https://blog.wikimedia.org/2014/11/14/apertium-and-wikimedia-a-collaboration-that-powers-the-content-translation-tool/
  6. Last table of https://en.wikipedia.org/wiki/Special:ContentTranslationStats – see rows for Nynorsk into Bokmål, Bokmål into Nynorsk, Bokmål (sometimes listed as "norsk") and Swedish into Danish.
  7. See stats for Catalan, Spanish, French and Italian under "New articles per day" 2013-2015
  8. a b Summing date-ordered categories of https://sv.wikipedia.org/wiki/Kategori:Robotskapade_artiklar – see also https://blog.wikimedia.org/2013/06/17/swedish-wikipedia-1-million-articles/
  9. https://stats.wikimedia.org/EN/SummarySV.htm
  10. https://stats.wikimedia.org/EN/SummaryNO.htm
  11. https://stats.wikimedia.org/EN/SummaryNN.htm
  12. https://stats.wikimedia.org/EN/SummaryDA.htm
  13. a b c d stats:EN/BotActivityMatrixCreates.htm
  14. https://da.wikipedia.org/wiki/Kategori:Bot-oprettede_artikler
  15. Unhammer, Kevin; Trosterud, Trond. "Reuse of free resources in machine translation between Nynorsk and Bokmål". In: Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation / Edited by Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Francis M. Tyers. Alicante : Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos, 2009, pp. 35-42